This is the project page for the Introduction to Open Data Science course of fall 2022. A link to the repository for this project page is here: IODS-project (github.com/teemuvh/IODS-project).
During this course, we get acquainted with R and certain data science methods, such as linear and logistic regression, clustering and classification algorithms, and dimensionality reduction models. During this course, we use the textbooks (1) Vehkalahti, Kimmo & Everitt, Brian S. (2019). Multivariate Analysis for the Behavioral Sciences and (2) Ewen Harrison and Riinu Pius (2021). R for Health Data Science.
During the first week, the focus is on getting everything up and running, and getting familiar with the syntax of R before moving to the more challenging topics of the course.
I use Python in my daily work, but so far I like how simple it is to read data into R.
# Here, we could load a csv-file to R as follows:
# data <- read_csv("some_data.csv")
# And view it as follows:
# View(data)
# But since we do not have "some_data" at our disposal,
# I have just added this as a comment.
# However, we can print the date:
date()
## [1] "Thu Dec 1 10:18:51 2022"
The Harrison & Pius book seems to work well for revising R syntax as well as learning some new tricks already for data transformations (mutate(), summarise(), pivot_wider(), factor(), etc.) The difficult part will be remembering when to utilise these functions…
In the Vehkalahti & Everitt textbook chapters 1 and 2, I liked how the same data were presented in different visualisation formats, which made me reflect a lot about how I could have presented some of my data before. I will attempt to use this information in the future to actually focus a bit more on what could be an optimal visualisation method for some given data that I work with. Additionally, it was fun to get to draw different visualisations with R, and again it seems simpler compared to Python. In terms of how the material works as a “crash course” to R, I think that having a background in programming helps in getting started with the material, since it makes the programming part a bit easier. I believe the material could work if one comes with no background in programming, but the learning curve is a bit steeper in the beginning.
It is nice to get to refresh my memory with the statistical methods as well. I have taken an introductory course to statistics but it was many years ago, so I have to recap/re-learn many of the topics. Thus, my favourite part of this week has been to actually refreshing my memory on these topics in statistics and I expect to learn a lot especially with respect to this field. I believe this information will be useful for me in the future in my own research.
Jump straight to the assignments.
date()
## [1] "Thu Dec 1 10:18:51 2022"
The core idea of regression analysis is to use statistical methods to assess the way in which a change in circumstances affect variation in an observed random variable. This section summarises my notes for the chapters regarding Simple linear regression and Multiple linear regression.
There are four basic assumptions in linear regression:
Simple linear regression model (intercept \(\beta_0\) is not always needed), has only a single explanatory variable (\(x\)) and a single dependent variable (\(y\)):
\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \]
where \(\beta_0\) is the intercept, \(\beta_1\) is the slope of the linear relationship between the dependent variable and the explanatory variable, and \(\epsilon\) is an error term that measures the difference between the observed value and the prediction from the model.
The prediction of the model is presented without the loss term as:
\[ \hat{y} = \beta_0 + \beta_1 x_i \]
This equation can then be used for predicting the value of a dependent variable for some explanatory variable.
The variability of the dependent by Regression Means Square (RGMS) and Residual Mean Square (RMS):
\[ RGMS = \sum_{i=1}^{n} (\hat{y_i} - \overline{y}_i)^2 \] \[ RMS = \sum_{i=1}^{n} \frac{(y_i - \hat{y_i})^2}{(n-2)} \]
We can assess our model by calculating the difference between an observed value and our prediction:
\[ \hat{\epsilon} = y - \hat{y} \]
R prints another measure for assessing how close the data are to the fitted line, R-squared:
\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample. R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. 1.0 indicates that all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.
And we should also use visualisation to assess the model; useful plots include:
We can decide whether a linear model is appropriate or we should use a non-linear model.
Here, we have more explanatory variables than in the simple regression (e.g., analysing change in blood pressure based on coffee consumption on smokers and non-smokers). Basically, we have multiple parameters that effect the prediction.
\[ \hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 \]
Confounding is a concept referring to a situation where another explanatory variable distorts the relationship between an explanatory variable and the outcome (e.g., smoking could be related to coffee consumption and to blood pressure).
Done and wrangled data in my GitHub repo.
Task 1:
First, we need to read the data we created in the data wrangling exercise. The data is stored in the data dictionary. Then we quickly check the structure and the dimensions of the data.
learning2014 <- read.table("data/learning2014.csv", sep=",", header=T)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
dim() outputs the dimensionality of the data ([166, 7]),
whereas str() outputs, in addition to the number of
observations and variables (i.e., dimensions), a little more information
about the data. str() also outputs a few examples of the
observations in the seven variables.
The variables in the data are gender, age, attitude, deep, stra, surf, and points. I think that in general, this data is about students’ attitudes towards studying statistics and how they feel about learning the topics on some statistics course. The different variables indicate a student’s gender, the age of the student, student’s attitude towards statistics (averaged over 10 answers to questions on a Likert scale). From the link given in the material, I interpret that Deep, strategic (stra), and surface (surf) indicate learning methods based on some clusters of questions related to these methods. Points indicates the points a student got from the course exam. In the dataset considered in this exercise, we have the previous variables for 166 students.
Task 2:
# access the GGally and ggplot2 libraries
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a plot matrix with ggpairs()
p <- ggpairs(learning2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
# draw the plot
p
I summarised the data to one scatterplot matrix From this visualisation above, we can observe different dimensions and how they interact. Scatterplots are drawn on the left side of the diagonal, the diagonal depicts the variable distribution, whereas the right side of the matrix displays the Pearson correlation.
First, we see that the sample has quite a lot more female students than males. Most of the students in both group were around the age of 20-25. Male students seem to have had slightly more positive attitude towards statistics, having most of the mass in the middle of the Likert scale (around 3.5), but also more answers in the higher end of the scale compared to female students. Male students seem to have had a bit stronger tendency towards deep learning strategy, whereas female students seem to have been more prone to the strategic dimension. Female students also seem to have been more prone to this surface dimension. Points in the exam seem quite similarly distributed between the groups. However, it seems that more male students have acquired the highest points.
Most of the variables seem not to be correlated based on the Pearson correlation coefficient. However, a few of them are. Quite understandably, attitude seems to correlate with the points acquired.
Task 3:
Now, we look at multivariable regression. Let the first explanatory variable be attitude, the second variable can be stra, and the third one surf. The decided variables are based on their correlation indicated by the Pearson correlation coefficient in the visualisation above.
We can first visualise all of the variables relationship with the exam scores independently:
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
library(ggplot2)
qplot(stra, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
library(ggplot2)
qplot(surf, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Based on the three visualisations, it seems that both attitude and stra have a positive correlation with the exam results. Surf, unsurprisingly, seems to have a slightly negative correlation. Next, we fit a multivariable regression model to the given explanatory variables.
# create a regression model with multiple explanatory variables
multivariable_model <- lm(points ~ attitude + stra + surf, data = learning2014)
# summary
multivariable_model %>%
summary()
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
The summary includes residuals in five points (min, first quartile, median, third quartile, and max). If the model fits perfectly to the data, the distribution between the residuals should be symmetrical.
Next, there is the coefficients block with values for estimate, Std. error, t-value, and p-value.
The estimate column indicates that a change of one point in attitude results in a change of approximately 3.4 points in the exam scores. With respect to stra and surf, the correlation is one to 0.9 and one to -0.6 respectively.
The Std. error indicates how far the estimates are from the true average values of the dependent variables.
T-value indicates the distance of our coefficient estimates from 0 as standard deviations. This value can be positive or negative, but it should be far from 0. A large difference to 0 would indicate relationship between the given variables (e.g., attitude and points), whereas 0 (or near to 0) means that there is no relationship.
Finally, we have the probability of getting any value that is equal or larger than t. This value should be as small as possible for us to be able to confidently reject the null hypothesis.
The significant codes are probably in the output for a quicker read of the results.
Lastly, there are the residual standard error (RSE), multiple and adjusted R-squared, and F-statistics. The RSE measures how well the model fits the data based on the residuals (\(y-\hat{y}\)). In practice, it is the square root of the mean squared error (please, correct me if I am wrong):
\[ RSE = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2} \] R-squared (\(R^2\)) is another value to assess how well the model fits the data. In practice, R-squared represents the proportion of the dependent variable which is explained by the explanatory variable. If R-squared is 0.0, then the explanatory variable has no effect on the dependent variable. If the result is 1.0, all of the variability is explained by the model (i.e., the regression line fits perfectly). The adjusted R-squared includes a penalty term that lowers the value for less important explanatory variables.
R-squared is calculated as:
\[ R^2 = 1 - \frac{\sum_{i=1}^{n}(y - \hat{y})^2}{\sum_{i=1}^{n}(y - \overline{y})^2} \] where \(\overline{y}\) is the mean value of the sample.
F-test compares the means of the groups and indicates whether at least one variable is statistically significant.
The summary of the model indicates that only attitude has a significant correlation with the exam points, whereas the correlations between stra and points and surf and points are not statistically significant. R-squared seems low to me. Perhaps this is because we only have so little data points to fit the model to?
Next, we remove the non-significant variables and fit the model again.
# create a regression model with multiple explanatory variables
attitude_model <- lm(points ~ attitude, data = learning2014)
# summary
attitude_model %>%
summary()
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Now, we only have the significant variable in the model.
Task 4:
Most of this task was done in the earlier task, but in short, a change of one point (on the Likert scale) in attitude results in a change of 3.5 points in the exam scores. R-squared was also explained above.
Task 5:
Finally, draw some visualisations of the residuals. First, I print the regression line of the attitude variable again, just to help in analysis:
library(ggplot2)
qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Next, I start outputting the different residual plots, beginning with Residuals vs Fitted.
# plot all plots:
# plot(attitude_model, which = c(1, 2, 5))
# but for the sake of the notebook, I plot these one by one
plot(attitude_model, which=1)
On the plot above, the residuals (y-axis) are plotted with respect to the estimated responses (x-axis). The residual line does not correspond exactly to the regression line. I think (but please correct me if I am wrong) what we can read from the line is that the variance is larger in the end of the data, meaning that the model might not have fitted well. This is likely an issue of small data size in this case.
plot(attitude_model, which=2)
The Q-Q plot shows that the data is not normally distributed, but skews in both ends of the data points. What I read from this is that there might be values in both ends that are very far from the expected mean if the data were normally distributed.
plot(attitude_model, which=5)
Residuals vs Leverage shows how important different data points are to the regression model. The spread of the data points should not change as a function of leverage, but in the right end of the line it seems to change. Perhaps this is again an issue with small data size? There are no data points outside of Cook’s distance, so there are no data points whose deletion from the data would have a high influence on the regression model.
Jump straight to the assignments.
date()
## [1] "Thu Dec 1 10:18:58 2022"
This part of the course will focus on Logistic Regression, which is a type of models where the output is categorical. For example, if we want to predict whether an email is spam or not, or whether a tumor is malign or benign, we could use a logistic regression model. Similarly, instead of only two, we could have multiple possible labels.
Linear regression models the population mean of the response variable directly as a linear function of the explanatory variables:
\[ E(y|x_1, x_2, ..., x_n) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]
With classification of categorical values, this is not a suitable way, but we need to output something that we can interpret as probabilities for the different outcomes. The issue with linear regression models is that the output can be any real number between \(-\infty\) and \(\infty\). Thus, we need a link function, which here is the logarithm of the odds: \(log(\frac{\pi}{1-\pi})\). Logistic regression then takes the form of:
\[ logit (\pi) = log(\frac{\pi}{1-\pi}) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n \]
Logistic regression function rearraged is:
\[ \pi = \frac{ \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)}{1- \exp (\beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n)} \] Observed variable \(y\) is represented as:
\[ y = \pi(x_1, x_2, ..., x_n) + \epsilon \] , where \(\epsilon = 1 - \pi(x_1, x_2, ..., x_n)\), with probability \(\pi(x_1, x_2, ..., x_n)\) if \(y = 1\). Otherwise, if \(y = 0\), \(\epsilon = \pi(x_1, x_2, ..., x_n)\), with probability \(1 - \pi(x_1, x_2, ..., x_n)\).
Data for the following analysis exercise here.
In this assignment, we will apply logistic regression to the alcohol consumption data created in the previous exercise. The original data is presented in Cortez & Silva, (2008).1
library(readr)
alc <- read_csv("data/alc.csv", show_col_types = FALSE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "schoolsup"
## [16] "famsup" "activities" "nursery" "higher" "internet"
## [21] "romantic" "famrel" "freetime" "goout" "Dalc"
## [26] "Walc" "health" "failures" "paid" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The wrangled data consists of 370 observations and 33 original
variables. In practice, the observations are Portuguese secondary school
students attending courses in mathematics and in Portuguese. The
variables include student demographics, social and school related
features, and grades. We have added another two variables,
alc_use and high_use, where
alc_use averages the weekday and weekend alcohol
consumption, while high_use is a truth value indicating
whether the students’ alcohol consumption is low (FALSE) or
high (TRUE).
The goal of this exercise is to attempt at explaining the
relationship between different explanatory variables to alcohol
consumption. As there are so many options, we define the number of
variables to be explored to four. There are many variables that I
believe can explain alcohol consumption, but for this experiment, I will
choose variables related to freetime activity. My hypothesis is that the
more a student has ‘other’ activities (internet,
romantic, activities) the less they have
freetime and the less they consume alcohol.
First, we might need to change the string-formatted answers to integers.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ stringr 1.4.1
## ✔ tidyr 1.2.1 ✔ forcats 0.5.2
## ✔ purrr 0.3.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(finalfit)
alc <- alc %>%
mutate(internet.int = factor(internet) %>%
fct_recode("0" = "no",
"1" = "yes"),
romantic.int = factor(romantic) %>%
fct_recode("0" = "no",
"1" = "yes"),
activities.int = factor(activities) %>%
fct_recode("0" = "no",
"1" = "yes"),
)
Next, let’s visualise the variables with respect to alcohol consumption.
# access the tidyverse packages dplyr and ggplot2
library(dplyr); library(ggplot2)
g1 <- ggplot(data = alc, aes(x = high_use, fill = romantic.int))
g1 + geom_bar() + facet_wrap("romantic")
g2 <- ggplot(data = alc, aes(x = high_use, fill = internet.int))
g2 + geom_bar() + facet_wrap("internet")
g3 <- ggplot(data = alc, aes(x = high_use, fill = activities.int))
g3 + geom_bar() + facet_wrap("activities")
g4 <- ggplot(data = alc, aes(x = high_use, fill = freetime))
g4 + geom_bar() + facet_wrap("freetime")
Based on the barplots, it seems that students who are in a romantic relationship are less prone to high usage of alcohol. Internet, however, does not show a similar effect. Rather, it seems that students who have internet connection at home seem to consume more alcohol. This is probably explained by the fact that not many households are without internet connection.
library(dplyr)
alc %>%
group_by(internet.int) %>%
summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
## internet.int percent
## <fct> <dbl>
## 1 0 15.4
## 2 1 84.6
As it seems, there are approximately 85% of households with internet connection.
Extracurricular activities seem to not have a large effect on alcohol consumption, but perhaps students who have no extracurricular activities might be slightly more prone to high alcohol consumption.
Lastly, it seems that the more free time students have, the more the smaller the gap between high and low alcohol consumption.
Next, I use logistic regression to analyse the variables further, and see whether my hypothesis really holds.
m <- glm(high_use ~ romantic.int + internet.int + activities.int + freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ romantic.int + internet.int + activities.int +
## freetime, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2100 -0.8970 -0.7151 1.3130 1.9156
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.0503 0.5010 -4.093 4.26e-05 ***
## romantic.int1 -0.1777 0.2510 -0.708 0.47882
## internet.int1 0.1793 0.3320 0.540 0.58925
## activities.int1 -0.3529 0.2329 -1.515 0.12983
## freetime 0.3895 0.1225 3.179 0.00148 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 439.01 on 365 degrees of freedom
## AIC: 449.01
##
## Number of Fisher Scoring iterations: 4
Based on logistic regression, it seems that only
freetime is statistically significant to high alcohol
consumption.
library(broom)
m %>%
tidy(conf.int = TRUE, exp = TRUE)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.129 0.501 -4.09 0.0000426 0.0466 0.334
## 2 romantic.int1 0.837 0.251 -0.708 0.479 0.508 1.36
## 3 internet.int1 1.20 0.332 0.540 0.589 0.635 2.35
## 4 activities.int1 0.703 0.233 -1.51 0.130 0.444 1.11
## 5 freetime 1.48 0.123 3.18 0.00148 1.17 1.89
Above, we have the coefficients (estimate) and the confidence intervals (conf.low and conf.high) for the different variables. For the only statistically significant explanatory variable is the amount of free time, I mostly focus on that. The coefficient, approx. 1.48, indicates that moving one step in freetime, 0 (very low) to 5 (very high), multiplies the odds of high alcohol consumption by 1.48. Or we can state that increase in freetime by one point increases the odds of high alcohol consumption by 48%. The negative coefficients (<1 in the tidy output) indicate that these explanatory variables (romantic relationship and extracurricular activities) have a decreasing effect on the risk of high alcohol consumption. I.e., if one is in a romantic relationship or has extracurricular activities, they are less prone for high alcohol consumption.
m <- glm(high_use ~ freetime, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE
## FALSE 259
## TRUE 111
It seems that the model fitted only on freetime always
predicts FALSE. We can calculate our model’s accuracy from
the confusion matrix by:
\[ acc = \frac{(TP + TN)}{(P + N)} \]
For our model, that would result in \(acc = \frac{259}{370} = 0.7\)
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.3
Our loss function returns a loss of 0.3, indicating that, on average, we predict the wrong label by a probability of 30%.
library(dplyr)
alc %>%
group_by(high_use) %>%
summarise(percent = 100 * n() / nrow(alc))
## # A tibble: 2 × 2
## high_use percent
## <lgl> <dbl>
## 1 FALSE 70
## 2 TRUE 30
The data consists of 70% of FALSE labels. Thus, if we
would always guess the majority class, like our model actually does, our
baseline accuracy would be 0.7. Thus, any useful model should be able to
obtain accuracy higher than that.
Bonus exercise 1
m2 <- glm(high_use ~ sex + failures + absences + famrel + goout + health, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m2, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2081081
I can get loss of approximately 0.2 in cross-validation by adding
sex, failures, absences,
famrel, goout, and health to the
model features.
date()
## [1] "Thu Dec 1 10:19:01 2022"
This chapter focuses on a set of clustering methods, designed for visualizing and exploring statistical data. In general, we want to train a model that can position data points to different clusters (or groups) based on their characteristics. After the model is trained, it can be used for unseen data to classify that data to the learnt clusters. The most popular of these methods is k-means clustering.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# load the data
data("Boston")
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset represents different explanatory variables related to the housing value in suburbs of Boston. The data consists of 506 observations and 14 variables. In practice, the variables are such that can be assumed to impact housing values, e.g., crime rate, air quality, average number of rooms, median value of owner-occupied homes, etc. More information about the data can be found from here.
# plot matrix of the variables
pairs(Boston)
Plot matrix shows a graphical overview of the data. However, this visualization is rather difficult to read. Correlation matrix provides a slightly clearer visualization.
library(corrplot)
## corrplot 0.92 loaded
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle")
From the correlation matrix, we can interpret the relationships
between different variables. For instance, it seems that proportion of
non-retail business acres per town, indicated by indus, has
a rather high negative correlation with weighted mean of distances to
five Boston employment centres, indicated by dis.
Unsurprisingly, indus also seems to have a high (positive)
correlation with nitrogen oxides concentration (parts per 10 million),
indicated by nox. Median value of owned homes
(medv) seems to correlate positively with number of rooms
(rm), and negatively with lower status of the population
(lstat).
What I think we can read from this data is, for instance, that in
areas where the median value of owned homes (medv) is high,
we see less crime, less non-retail businesses, better air quality, more
rooms in houses, less pupils per teacher, etc, based on the correlations
between the variables.
Summaries of the variables:
# summaries of the variables of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Next, we standardize the values with:
\[ scaled(x) = \frac{x-mean(x)}{std(x)} \]
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)
Next, we use the scaled values of the crime rate to create bins where we store crime rates as categorical variables (scale of four from low crime rate to high crime rate). Furthermore, we drom the old crime rates from the data, and replace them with the crime rate quantiles.
boston_scaled$crim <- as.numeric(boston_scaled$crim)
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim,
breaks = bins,
include.lowest = TRUE,
label = c("low", "med_low", "med_high", "high"))
# look at the table of the new factor crime
table(crime)
## crime
## low med_low med_high high
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Before we can train a model for k-means clustering (or any other ML technique), we should split the data to train and test sets. Here, we create a split of 80% of training data, and 20% of test data. For tuning the model, we might want to also get development data, so our split could be 80-10-10, but for now, we keep with 80-20. To the best of my understanding, from now on, each row in the data (consisting of observations for 14 variables) will become a 13-dimensional vector that represents one data point for our model. The output, predicted from the 13-dimensional vector should then be the crime rate class (4 classes: low, med_low, med_high, high).
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
nrow(train); nrow(test);
## [1] 404
## [1] 102
Next, we fit the LDA model to the training data.
# LDA
lda.fit <- lda(crime ~ ., data = train)
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(crime)
# plot the lda results
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
In the 2-dimensional visualization of the Linear Discriminant
Analysis, we see that the crime clusters are rather clear, high crime
rates being in the right part of the space, whereas med_high is
positioned to the bottom of the space. Low crime rate is at the top left
of the space. Med_high and med_low seem to create a larger cloud that is
not very clearly separated but it is somewhat overlapping, likely
because their values are also close to each others when we generated the
bins of the crime rates. Additionally, from the visualization, we can
observe where variables are clustered by the model. We see that the
model associates index of accessibility to radial highways
(rad) quite strongly with high crime rate, whereas
proportion of residential land zoned for lots over 25,000 sq.ft.
(zn) is quite strongly associated with low crime rate.
Let us then see how the clustering model generalizes to unseen data.
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
predictions <- lda.pred$class
table(correct = correct_classes, predicted = predictions)
## predicted
## correct low med_low med_high high
## low 12 11 1 0
## med_low 6 17 4 0
## med_high 0 11 19 0
## high 0 0 0 21
The model does not seem very robust, and it predicts wrong labels especially when the correct label is med_low. However, those are the more difficult examples to predict correctly.
Finally, let us calculate the distances between the observations, and run k-means clustering.
library(MASS)
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance
dist <- dist(boston_scaled, method = "euclidean")
# look at the summary of the distances
summary(dist)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
K-means
set.seed(13)
# k-means clustering with 4 clusters
km <- kmeans(Boston, centers = 4)
# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)
The clusters are not very visible from such a small scatter plot matrix.
pairs(boston_scaled[6:10], col = km$cluster)
Here we notice that some of the clusters are on top of each others, and the number of clusters is potentially too high. Let’s attempt at finding the optimal number of clusters.
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
Above, the within cluster sum of squares (WCSS) provides us a way to find the optimal number of clusters. Practically, the optimal number is that where the value changes radically. From the visualization, we can approximate that the optimal number of clusters is 2, because at that point the value changes quite a lot.
# New k-means with 2 clusters
km <- kmeans(boston_scaled, centers = 2)
# plot the scaled Boston dataset with 2 clusters
pairs(Boston, col = km$cluster)
pairs(boston_scaled[6:10], col = km$cluster)
Now the clusters are perhaps more visible, and, hopefully, in their own positions in the space.
This weeks material will focus on a set of statistical methods that fall under the term dimensionality reduction. In practice, dimensionality reduction is a way to recognize and visualize the dimensions that carry the most information from a multidimensional data by collecting as much variance as possible from the original variables. One of the most well-known methods in dimensionality reduction is Principal Component Analysis (PCA). After we have recognized the principal components by a few matrix transformations, we can observe the components in a 2-dimensional space, which would not be possible without performing dimensionality reduction.
Another method we focus on is Multiple correspondence analysis (MCA), with which we can find suitable transformations from classified variables to continuous scales and then reducing the dimensions with the PCA for visualization purposes.
Task 5: Data wrangling
Done, and the assignment is in the latter part of this file.
Task 5: Analysis
library(dplyr)
human <- read.table("data/human.txt", sep = ",", header = T)
Now that we have wrangled the “human” data, we can start working on the analysis part. The data consists of 155 observations and 8 explanatory variables. These observations are designed to capture a wider representation of the development of a country than merely looking at the economic growth would do. Thus, the data includes variables related gender inequality, life expectancy, eduaction, et cetera. More information about the data can be obtained from here. The below figure provides a graphical overview of the data.
library(GGally)
ggpairs(human)
summary(human)
## edu.fm labo.fm edu.exp life.exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni maternal.mortality.ratio ado.birth.rate parli.f
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
From the above visualization, we can observe how the variables
interact with each others in terms of Pearson correlation coefficient
(right of the diagonal). For instance, if we consider adolescent birth
rate (ado.birth.rate), we can easily see that it correlates
strongly with the ratio of female and male education
(edu.fm), as well as with expected years of education
(edu.exp), life expectancy (life.exp), and
GNI, and maternal mortality (maternal.mortality.ratio).
From the scatter plots (left of the diagonal), we can observe some
correlation with the data points. For example, expected years of
education (edu.exp) seems to positively correlate with life
expectancy (life.exp). Similarly, for instance expected
years of education seems to negatively correlate with maternal
mortality.
These information are perhaps more easily visible from a heat map of
correlations, represented in the figure below. In the heat map, the
darker the color, the more correlation between the variables. Blue color
indicates positive correlation, whereas red color indicates negative
correlation. If we look at the variables discussed above, we notice that
adolescent birth rate correlates rather strongly with the ratio of
education (edu.fm), and even more with life expectany, for
instance. If we consider expected years of education, we notice that it
has a strong positive correlation with life expectancy, and a high
negative correlation with maternal mortality. Thus, we can read that
education increases the life expectancy, and decreases the maternal
mortality.
library(corrplot)
cor(human) %>%
corrplot(method="color")
Now that we have summarized and observed the data a little, we can move to dimensionality reduction.
PCA performs SVD-decomposition to find the principal components of the data. The first component will explain most of the variance, and the second component, which is orthogonal to the first component, will explain most of the variance after the first component has been removed. This way, we will obtain two dimensions, one represents principal component 1, and the second represents principal component 2.
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human, scale = FALSE)
pca_human_std <- prcomp(human, scale = TRUE)
sh <- summary(pca_human)
sh_std <- summary(pca_human_std)
pca_pr_sh <- round(100*sh$importance[2, ], digits = 1)
pca_pr_sh_std <- round(100*sh_std$importance[2, ], digits = 1)
pc_lab_sh <- paste0(names(pca_pr_sh), " (", pca_pr_sh, "%)")
pc_lab_sh_std <- paste0(names(pca_pr_sh_std), " (", pca_pr_sh_std, "%)")
# draw a biplot of the principal component representation and the original variables
biplot(pca_human,
choices = 1:2,
cex = c(0.5, 0.5),
col = c("grey40", "deeppink2"),
xlab = pc_lab_sh[1], ylab = pc_lab_sh[2]
)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
biplot(pca_human_std,
choices = 1:2,
cex = c(0.5, 0.5),
col = c("grey40", "deeppink2"),
xlab = pc_lab_sh_std[1], ylab = pc_lab_sh_std[2],
)
From the visualizations above, we can observe that it is necessary to standardize the data before applying PCA. As PCA aims to maximize the variance, non-standardized data can seem like some component dictates the variance, when, in truth, some other components might also contribute (first figure). After standardizing the data, the model can actually find the components that maximize the variance (second figure). This is also suggested by the values in the summaries of the PCA (below). Based on the proportion of variances explained by the different principal components, PCA suggests that the first principal component explains practically all of the variance in the non-standardized data. However, when we standardize the data, PCA suggests that the first principal component explains slightly more than half of the variance, whereas the second principal component explains approximately 16% of the variance.
sh; sh_std
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
We can also attempt at interpreting the principal components. I focus on the PCA of the standardized data (second figure above). It seems that life expectancy, and expected years of education (as well as the ratio between male and female expected education) increase in the countries clustered to the left of the space. Similarly, the number of female representatives in the parliament seems to increase to the top left corner of the space. Many countries in the left side of the space seem to be European or East Asian, but there are also a couple of rich countries from the Arabian peninsula. Countries in the top left corner of the space seem to be mostly Scandinavian countries. Countries clustered on right end of the space seem to be mostly from the African continent. In these countries, maternal mortality seems to be high, with also high adolescence birth rate, the two being phenomena that probably go hand in hand.
Finally, my personal interpretation of the first two principal components of the standardized dataset: the first principal component seems to represent life expectancy, and variables that correspond to life expectancy. The second principal component seems to somewhat represent gender inequality.
In the final part of this exercise, we investigate a dataset that consists of answers to a questionnaire about tea consumption.
tea <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv",
sep = ",", header = T)
str(tea); dim(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : chr "breakfast" "breakfast" "Not.breakfast" "Not.breakfast" ...
## $ tea.time : chr "Not.tea time" "Not.tea time" "tea time" "Not.tea time" ...
## $ evening : chr "Not.evening" "Not.evening" "evening" "Not.evening" ...
## $ lunch : chr "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
## $ dinner : chr "Not.dinner" "Not.dinner" "dinner" "dinner" ...
## $ always : chr "Not.always" "Not.always" "Not.always" "Not.always" ...
## $ home : chr "home" "home" "home" "home" ...
## $ work : chr "Not.work" "Not.work" "work" "Not.work" ...
## $ tearoom : chr "Not.tearoom" "Not.tearoom" "Not.tearoom" "Not.tearoom" ...
## $ friends : chr "Not.friends" "Not.friends" "friends" "Not.friends" ...
## $ resto : chr "Not.resto" "Not.resto" "resto" "Not.resto" ...
## $ pub : chr "Not.pub" "Not.pub" "Not.pub" "Not.pub" ...
## $ Tea : chr "black" "black" "Earl Grey" "Earl Grey" ...
## $ How : chr "alone" "milk" "alone" "alone" ...
## $ sugar : chr "sugar" "No.sugar" "No.sugar" "sugar" ...
## $ how : chr "tea bag" "tea bag" "tea bag" "tea bag" ...
## $ where : chr "chain store" "chain store" "chain store" "chain store" ...
## $ price : chr "p_unknown" "p_variable" "p_variable" "p_variable" ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : chr "M" "F" "F" "M" ...
## $ SPC : chr "middle" "middle" "other worker" "student" ...
## $ Sport : chr "sportsman" "sportsman" "sportsman" "Not.sportsman" ...
## $ age_Q : chr "35-44" "45-59" "45-59" "15-24" ...
## $ frequency : chr "1/day" "1/day" "+2/day" "1/day" ...
## $ escape.exoticism: chr "Not.escape-exoticism" "escape-exoticism" "Not.escape-exoticism" "escape-exoticism" ...
## $ spirituality : chr "Not.spirituality" "Not.spirituality" "Not.spirituality" "spirituality" ...
## $ healthy : chr "healthy" "healthy" "healthy" "healthy" ...
## $ diuretic : chr "Not.diuretic" "diuretic" "diuretic" "Not.diuretic" ...
## $ friendliness : chr "Not.friendliness" "Not.friendliness" "friendliness" "Not.friendliness" ...
## $ iron.absorption : chr "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" "Not.iron absorption" ...
## $ feminine : chr "Not.feminine" "Not.feminine" "Not.feminine" "Not.feminine" ...
## $ sophisticated : chr "Not.sophisticated" "Not.sophisticated" "Not.sophisticated" "sophisticated" ...
## $ slimming : chr "No.slimming" "No.slimming" "No.slimming" "No.slimming" ...
## $ exciting : chr "No.exciting" "exciting" "No.exciting" "No.exciting" ...
## $ relaxing : chr "No.relaxing" "No.relaxing" "relaxing" "relaxing" ...
## $ effect.on.health: chr "No.effect on health" "No.effect on health" "No.effect on health" "No.effect on health" ...
## [1] 300 36
All the variables are categorical. Thus, we need to convert them all to factors.
col_names <- names(tea)
tea[,col_names] <- lapply(tea[,col_names] , factor)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : Factor w/ 61 levels "15","17","18",..: 24 30 32 8 33 6 22 21 25 22 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
## $ frequency : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
library(dplyr)
library(tidyr)
library(ggplot2)
vis <- pivot_longer(tea, cols = everything()) %>%
ggplot(aes(value)) +
facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 0))
vis
I somehow could not fit the visualization of all the variables in this page so that they would include the labels for the bars. Thus, I select a few of the variables and plot them again with the labels. These selected variables are also used for MCA soon.
library(dplyr)
library(tidyr)
library(ggplot2)
# column names to keep in the dataset
keep <- c("Tea", "How", "how", "sugar", "where", "lunch", "breakfast")
# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep))
# visualize the dataset
vis_ <- pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) +
facet_wrap("name", scales = "free") +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
vis_
library(FactoMineR)
mca <- MCA(tea_time, graph = FALSE)
# visualize MCA
plot(mca, invisible=c("ind"), graph.type = "classic", habillage = "quali")
In the visualization above, different colors correspond to different variables in the data. Black indicates the type of tea, pink indicates where the tea was purchased from, green indicates how the tea is packaged, red indicates how it is enjoyed (with milk, lemon, other, or alone), brown indicates whether the tea is consumed at lunch, and grey whether it is enjoyed on breakfast. From the MCA results, we can observe that for instance, green tea is rather typically bought unpackaged from a tea shop, whereas Earl Grey and black tea are more typically bought in tea bags from a chain store. Earl Grey seems to be enjoyed more with milk and sugar, whereas black tea is enjoyed with lemon. It seems that Earl Grey is a good breakfast tea, especially with milk, whereas green tea alone could be better suited outside of lunch or breakfast.
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.↩︎